ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT

نویسندگان

  • Liane Guillou
  • Christian Hardmeier
  • Aaron Smith
  • Jörg Tiedemann
  • Bonnie L. Webber
چکیده

We present ParCor, a parallel corpus of texts in which pronoun coreference – reduced coreference in which pronouns are used as referring expressions – has been annotated. The corpus is intended to be used both as a resource from which to learn systematic differences in pronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed at addressing the problem of pronoun coreference in translation. At present, the corpus consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent. We provide details of the texts that we selected, the guidelines and tools used to support annotation and some corpus statistics. The texts in the corpus have already been translated into many languages, and we plan to expand the corpus into these other languages, as well as other genres, in the future.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Pronoun Translation by Modeling Coreference Uncertainty

Information about the antecedents of pronouns is considered essential to solve certain translation divergencies, such as those concerning the English pronoun it when translated into gendered languages, e.g. for French into il, elle, or several other options. However, no machine translation system using anaphora resolution has so far been able to outperform a phrase-based statistical MT baseline...

متن کامل

Translation of "It" in a Deep Syntax Framework

We present a novel approach to the translation of the English personal pronoun it to Czech. We conduct a linguistic analysis on how the distinct categories of it are usually mapped to their Czech counterparts. Armed with these observations, we design a discriminative translation model of it, which is then integrated into the TectoMT deep syntax MT framework. Features in the model take advantage...

متن کامل

Using Coreference Links to Improve Spanish-to-English Machine Translation

In this paper, we present a proof-ofconcept of a coreference-aware decoder for document-level machine translation. We consider that better translations should have coreference links that are closer to those in the source text, and implement this criterion in two ways. First, we define a similarity measure between source and target coreference structures, by projecting the target ones onto the s...

متن کامل

A Document-Level SMT System with Integrated Pronoun Prediction

This paper describes one of Uppsala University’s submissions to the pronounfocused machine translation (MT) shared task at DiscoMT 2015. The system is based on phrase-based statistical MT implemented with the document-level decoder Docent. It includes a neural network for pronoun prediction trained with latent anaphora resolution. At translation time, coreference information is obtained from th...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014